166 ◾ Bioinformatics
which one is lowly transcribed or even not transcribed at all. The RNA-Seq count data is
used as an alternative to microarray data in eQTL analysis. QTL analysis is a statistical
method that links phenotypic data (trait measurements) and genotypic data (markers usu-
ally SNPs) in an attempt to explain the genetic basis of variation in complex traits. On the
other hand, eQTL analysis links markers (genotype) with gene expression levels measured
in a large number of individuals and the data is modeled using generalized linear models.
RNA-Seq is a powerful tool for detecting alternative splice patterns, which are important
to understand development of human diseases. Paired-end sequencing enables sequence
information from both ends and help in detecting splicing patterns without requirement
for previous knowledge of transcript annotations. The single-molecule, real-time (SMRT)
sequencing is the core technology powering long-read sequencing that allows examination
of splicing patterns and transcript connectivity in a genome-scale manner by generating
full-length transcript sequences.
RNA-Seq is also used for fusion gene detection. A fusion gene is a gene made by join-
ing two different genes. It is usually created when a gene from one chromosome moves to
another chromosome. The fusion gene is transcribed into mRNA that will be translated
into fusion protein. The fusion proteins implicate usually in some types of cancer includ-
ing leukemia; soft tissue sarcoma; cancers of the prostate, breast, lung, bladder, colon,
and rectum; and CNS tumors. Paired-end RNA-Seq data are usually used for fusion gene
detection [4].
Other kinds of RNA-Seq applications include integration of RNA-Seq data analysis with
other technologies.
The library preparation of the mRNA is similar to that of DNA. However, mRNA must
be separated from other types of RNA by enrichment technique which uses either PCR
amplification or the depletion of the other types of RNA. The RNA must be converted
into complementary DNA (cDNA) by reverse transcription before library preparation. As
DNA library, the cDNA library preparation involves fragmentation and adaptor ligation
to each end of the fragments. The cDNA fragments then are sequenced with the sequenc-
ing machine and the sequencing can either be single end (forward strand only) or paired
end (forward and reverse strands). The sequencing generates sequence data in a form of
reads in FASTQ files. Those reads are the sequenced fragments of the expressed genes in
the sample.
5.3 RNA-SEQ DATA ANALYSIS WORKFLOW
The first steps of the RNA-Seq are the same as in other sequencing applications. The
sequencing raw data (usually in FASTQ files) must pass through the quality control steps
that were discussed in detail in Chapter 1. In general, the steps of the workflow include
quality control, read alignment, read quantification, differential expression, annotation,
and interpretation (Figure 5.1).
5.3.1 Acquiring RNA-Seq Data
The RNA-Seq raw data are sequence reads produced by a sequencing instrument. RNA-
Seq sequence raw data for some projects are available in public databases and can be